Goto

Collaborating Authors

 fitted q-iteration


Finite-Time Bounds for Average-Reward Fitted Q-Iteration

Neural Information Processing Systems

Although there is an extensive body of work characterizing the sample complexity of discounted-return offline RL with function approximations, prior work on the average-reward setting has received significantly less attention, and existing approaches rely on restrictive assumptions, such as ergodicity or linearity of the MDP. In this work, we establish the first sample complexity results for average-reward offline RL with function approximation for weakly communicating MDPs, a much milder assumption. To this end, we introduce Anchored Fitted Q-Iteration, which combines the standard Fitted Q-Iteration with an anchor mechanism. We show that the anchor, which can be interpreted as a form of weight decay, is crucial for enabling finite-time analysis in the average-reward setting. We also extend our finite-time analysis to the setup where the dataset is generated from a singletrajectory rather than IID transitions, again leveraging the anchor mechanism.


Finite-Time Bounds for Average-Reward Fitted Q-Iteration

Neural Information Processing Systems

Although there is an extensive body of work characterizing the sample complexity of discounted-return offline RL with function approximations, prior work on the average-reward setting has received significantly less attention, and existing approaches rely on restrictive assumptions, such as ergodicity or linearity of the MDP. In this work, we establish the first sample complexity results for average-reward offline RL with function approximation for weakly communicating MDPs, a much milder assumption. To this end, we introduce Anchored Fitted Q-Iteration, which combines the standard Fitted Q-Iteration with an anchor mechanism. We show that the anchor, which can be interpreted as a form of weight decay, is crucial for enabling finite-time analysis in the average-reward setting. We also extend our finite-time analysis to the setup where the dataset is generated from a single-trajectory rather than IID transitions, again leveraging the anchor mechanism.


A Unifying View of Linear Function Approximation in Off-Policy RL Through Matrix Splitting and Preconditioning

Neural Information Processing Systems

In off-policy policy evaluation (OPE) tasks within reinforcement learning, Temporal Difference Learning(TD) and Fitted Q-Iteration (FQI) have traditionally been viewed as differing in the number of updates toward the target value function: TD makes one update, FQI makes an infinite number, and Partial Fitted Q-Iteration (PFQI) performs a finite number. We show that this view is not accurate, and provide a new mathematical perspective under linear value function approximation that unifies these methods as a single iterative method solving same linear system, but using different matrix splitting schemes and preconditioners. We show that increasing the number of updates under the same target value function, i.e., the target network technique, is a transition from using a constant preconditioner to using a data-feature adaptive preconditioner. This elucidates, for the first time, why TD convergence does not necessarily imply FQI convergence, and establishes tight convergence connections among TD, PFQI, and FQI. Our framework enables sharper theoretical results than previous work and characterization of the convergence conditions for each algorithm, without relying on assumptions about the features (e.g., linear independence). We also provide an encoder-decoder perspective to better understand TD's convergence conditions, and prove, for the first time, that when a large learning rate doesn't work, trying a smaller one may help(for batch TD). Our framework also leads to the discovery of new crucial conditions on features for convergence, and shows how common assumptions about features influence convergence, e.g., the assumption of linearly independent features can be dropped without compromising the convergence guarantees of stochastic TD in the on-policy setting. This paper is also the first to introduce matrix splitting into the convergence analysis of these algorithms.


Finite-Time Bounds for Average-Reward Fitted Q-Iteration

arXiv.org Artificial Intelligence

Although there is an extensive body of work characterizing the sample complexity of discounted-return offline RL with function approximations, prior work on the average-reward setting has received significantly less attention, and existing approaches rely on restrictive assumptions, such as ergodicity or linearity of the MDP. In this work, we establish the first sample complexity results for average-reward offline RL with function approximation for weakly communicating MDPs, a much milder assumption. To this end, we introduce Anchored Fitted Q-Iteration, which combines the standard Fitted Q-Iteration with an anchor mechanism. We show that the anchor, which can be interpreted as a form of weight decay, is crucial for enabling finite-time analysis in the average-reward setting. We also extend our finite-time analysis to the setup where the dataset is generated from a single-trajectory rather than IID transitions, again leveraging the anchor mechanism.


Automatic Double Reinforcement Learning in Semiparametric Markov Decision Processes with Applications to Long-Term Causal Inference

arXiv.org Machine Learning

Double reinforcement learning (DRL) enables statistically efficient inference on the value of a policy in a nonparametric Markov Decision Process (MDP) given trajectories generated by another policy. However, this approach necessarily requires stringent overlap between the state distributions, which is often violated in practice. To relax this requirement and extend DRL, we study efficient inference on linear functionals of the $Q$-function (of which policy value is a special case) in infinite-horizon, time-invariant MDPs under semiparametric restrictions on the $Q$-function. These restrictions can reduce the overlap requirement and lower the efficiency bound, yielding more precise estimates. As an important example, we study the evaluation of long-term value under domain adaptation, given a few short trajectories from the new domain and restrictions on the difference between the domains. This can be used for long-term causal inference. Our method combines flexible estimates of the $Q$-function and the Riesz representer of the functional of interest (e.g., the stationary state density ratio for policy value) and is automatic in that we do not need to know the form of the latter - only the functional we care about. To address potential model misspecification bias, we extend the adaptive debiased machine learning (ADML) framework of \citet{van2023adaptive} to construct nonparametrically valid and superefficient estimators that adapt to the functional form of the $Q$-function. As a special case, we propose a novel adaptive debiased plug-in estimator that uses isotonic-calibrated fitted $Q$-iteration - a new calibration algorithm for MDPs - to circumvent the computational challenges of estimating debiasing nuisances from min-max objectives.


Fitted Q-iteration in continuous action-space MDPs

Neural Information Processing Systems

We consider continuous state, continuous action batch reinforcement learning where the goal is to learn a good policy from a sufficiently rich trajectory generated by another policy. We study a variant of fitted Q-iteration, where the greedy action selection is replaced by searching for a policy in a restricted set of candidate policies by maximizing the average action values. We provide a rigorous theoretical analysis of this algorithm, proving what we believe is the first finite-time bounds for value-function based algorithms for continuous state- and action-space problems.


Fitted Q-iteration by Advantage Weighted Regression

Neural Information Processing Systems

Recently, fitted Q-iteration (FQI) based methods have become more popular due to their increased sample efficiency, a more stable learning process and the higher quality of the resulting policy. However, these methods remain hard to use for continuous action spaces which frequently occur in real-world tasks, e.g., in robotics and other technical applications. The greedy action selection commonly used for the policy improvement step is particularly problematic as it is expensive for continuous actions, can cause an unstable learning process, introduces an optimization bias and results in highly non-smooth policies unsuitable for real-world systems. In this paper, we show that by using a soft-greedy action selection the policy improvement step used in FQI can be simplified to an inexpensive advantage-weighted regression. With this result, we are able to derive a new, computationally efficient FQI algorithm which can even deal with high dimensional action spaces.


Fitted Q-iteration in continuous action-space MDPs

Neural Information Processing Systems

We consider continuous state, continuous action batch reinforcement learning where the goal is to learn a good policy from a sufficiently rich trajectory generated by another policy. We study a variant of fitted Q-iteration, where the greedy action selection is replaced by searching for a policy in a restricted set of candidate policies by maximizing the average action values. We provide a rigorous theoretical analysis of this algorithm, proving what we believe is the first finite-time bounds for value-function based algorithms for continuous state- and action-space problems. Papers published at the Neural Information Processing Systems Conference.


Fitted Q-iteration by Advantage Weighted Regression

Neural Information Processing Systems

Recently, fitted Q-iteration (FQI) based methods have become more popular due to their increased sample efficiency, a more stable learning process and the higher quality of the resulting policy. However, these methods remain hard to use for continuous action spaces which frequently occur in real-world tasks, e.g., in robotics and other technical applications. The greedy action selection commonly used for the policy improvement step is particularly problematic as it is expensive for continuous actions, can cause an unstable learning process, introduces an optimization bias and results in highly non-smooth policies unsuitable for real-world systems. In this paper, we show that by using a soft-greedy action selection the policy improvement step used in FQI can be simplified to an inexpensive advantage-weighted regression. With this result, we are able to derive a new, computationally efficient FQI algorithm which can even deal with high dimensional action spaces. Papers published at the Neural Information Processing Systems Conference.


Advantages and Limitations of using Successor Features for Transfer in Reinforcement Learning

arXiv.org Machine Learning

One question central to Reinforcement Learning is how to learn a feature representation that supports algorithm scaling and re-use of learned information from different tasks. Successor Features approach this problem by learning a feature representation that satisfies a temporal constraint. We present an implementation of an approach that decouples the feature representation from the reward function, making it suitable for transferring knowledge between domains. We then assess the advantages and limitations of using Successor Features for transfer.